Green-Thread Actor Runtime
Erlang's isolation model. Rust's zero-copy ownership. No function colouring.
smarm is a prototype concurrent runtime for Rust. Each actor is a green thread with its own
mmap'd stack. N OS threads share a single global run queue. Actors communicate
exclusively via message passing (owned values over channels); no shared mutable state
without an explicit Arc<Mutex<T>>.
Preemption is allocator-driven: every Nth heap allocation, smarm reads RDTSC and yields the actor if its timeslice has expired. No OS signals, no separate timer thread for scheduling.
No function colouring. No Box<dyn Future>. No poll state machines. Just plain Rust functions that block.
64 KB stacks instead of 8 MB. Context switch in ~10–20 ns (6 GPR saves + ret) instead of kernel mode.
Zero-copy ownership via Rust's type system. No GC pause. No copying GC. Message passing is a move, not a clone.
Module Map
13 source modules, three rough layers. The bottom layer has zero smarm dependencies; middle layer builds the runtime machinery; top layer is public API.
Calls mmap for a contiguous region, then mprotect's the bottom page to PROT_NONE. Stack grows downward; overflow hits the guard page → SIGSEGV. Implements Drop via munmap. Zero smarm dependencies.
Two #[naked] assembly functions (switch_to_actor, switch_to_scheduler). Save 6 callee-saved GPRs, swap rsp, restore, ret.
Thread-locals hold each side's saved stack pointer. XMM registers not
saved here — compiler guarantees spill at Rust call sites.
Implements GlobalAlloc — wraps System allocator. On every Nth alloc, reads RDTSC. If elapsed > timeslice_cycles and preemption is enabled, calls switch_to_scheduler(). Thread-locals hold the countdown, start timestamp, and an enabled flag (scheduler disables it to prevent self-preemption).
struct Pid(u32 index, u32 generation). Index = slot in the actor table. Generation increments on actor death. Stale handles are detectable: a Pid with wrong generation fails slot lookup rather than silently addressing a new actor. Solves ABA without exhausting PID space.
Owns the Stack. Defines the trampoline: every actor's first ret lands here. Trampoline reads the closure from a thread-local, calls it inside catch_unwind, writes the Outcome
to another thread-local, then yields back to the scheduler.
Thread-locals: current PID, pending closure, last outcome, done flag.
The heaviest module. Contains SharedState (slot table, run queue, timers, IO), RuntimeInner (shared state behind a mutex, per-thread stats, drain lock), and schedule_loop
— the main scheduler loop that drains timers, drains IO completions,
pops actors, resumes them, and handles the post-yield intent (re-queue
vs park vs finalize).
Unbounded MPSC. Inner state is Arc<Mutex<Inner<T>>> — senders are clonable, last drop closes channel. recv(): checks queue; if empty, registers self as parked_receiver, releases the lock, calls park_current(). send(): pushes, takes the parked PID, calls unpark(pid).
Actor-aware mutex with mandatory timeout (default 30s). Fast
path: no holder → grant immediately. Slow path: join FIFO waiter queue,
insert a WaitTimeout timer, park. On timer expiry: if actor is still in waiters, unpark it with LockTimeout. On guard drop: pop next waiter, grant, unpark.
Two background OS threads: an epoll thread (waits on fds with EPOLLONESHOT; on ready, pushes FdReady completion) and a pool thread (runs blocking closures inside catch_unwind; pushes Blocking completion). Both write a wake pipe byte to stir the scheduler. Completions are drained inside schedule_loop.
BinaryHeap<Reverse<Entry>> = min-heap by deadline. Two Reason variants: Sleep (unpark unconditionally) and WaitTimeout (call target.on_timeout()). No cancellation — stale entries are no-ops on pop. Entries inserted by sleep() and mutex::lock_timeout().
Thin facade. Exposes spawn, yield_now, park_current, unpark, sleep, block_on_io, wait_readable, wait_writable, run. All delegate to runtime. Also owns JoinHandle and the NoPreempt RAII guard.
Just the Signal enum: Exit(Pid) or Panic(Pid, Box<dyn Any+Send>). No restart logic — that's user-space policy. Signals are delivered via the supervisor actor's own channel (Sender<Signal> stored in the child's slot).
Who Imports What
The critical insight: runtime.rs is the hub. Every substantive module either feeds into it or is orchestrated by it. scheduler.rs is purely a facade — it imports runtime and re-exports it through the public API.
Circular dependency: channel and mutex call scheduler::unpark(), which calls into runtime. And runtime's schedule_loop resumes actors that run channel/mutex code. This is intentional — it's the cooperative unpark mechanism. It works because unpark() never blocks and preemption is disabled while holding any smarm internal lock.
What Happens When You Call run(f)
Starting from user code calling smarm::run(|| { ... }). The single-threaded run() is a wrapper around runtime::init(Config::exact(1)).run(f).
Install panic hook (once)
A OnceLock guard installs a custom panic hook
that suppresses output inside actor context. Without this, concurrent
actor panics can deadlock Rust's default backtrace printer
(non-reentrant internal lock). The previous hook is chained for panics
outside actors.
Start IoThread io.rs
Creates a wake pipe (non-blocking O_NONBLOCK). Creates an epollfd. Creates a shutdown pipe and registers it in the epollfd. Spawns the epoll thread (epoll_wait loop) and the pool thread (blocking-work mpsc receiver). Both share a completion VecDeque behind a mutex.
Install RUNTIME thread-local runtime.rs
Arc<RuntimeInner> is cloned into the calling thread's RUNTIME thread-local. This makes with_runtime() work on the calling thread immediately — needed for the next step.
Spawn initial actor scheduler.rs
Calls scheduler::spawn(f). This locks SharedState, allocates a slot, creates a Stack via mmap, calls init_actor_stack() to write the initial register frame (trampoline address + 6 zero GPR slots), stores the closure in pending_closures, pushes the PID to the run queue, returns a JoinHandle.
Spawn N-1 OS scheduler threads
For each extra thread: clone Arc<RuntimeInner>, spawn OS thread, set RUNTIME and SCHED_SLOT thread-locals, enter schedule_loop. Thread 0 is the calling thread.
Enter schedule_loop on thread 0 runtime.rs
This is a loop { drain → pop → resume → handle-intent }.
Thread 0 blocks here until the run queue is empty and no timers or IO
are pending. All actors run inside this loop. This call does not return
until the program is done.
Shutdown sequence
All scheduler threads return from schedule_loop. OS threads are joined. IoThread::drop() is called: writes shutdown pipe → epoll thread exits; drops the mpsc sender → pool thread exits; closes all fds. SharedState is cleared for potential next run() call.
The Yield → Schedule → Resume Cycle
This is the heartbeat of the entire runtime. Every context switch follows exactly this path, whether triggered by a cooperative yield, preemption, channel recv, mutex contention, or IO wait.
The 6 Yield Sources
| Source | Intent set | Who re-queues | Notes |
|---|---|---|---|
yield_now() |
Yield | Scheduler immediately | Actor stays Runnable; pushed back to queue tail |
| Allocator preemption | Yield | Scheduler immediately | RDTSC check in maybe_preempt() triggers switch_to_scheduler() |
channel::recv() (empty) |
Park | channel::send() → unpark() |
Receiver PID stored in channel's parked_receiver |
mutex::lock() (contended) |
Park | MutexGuard::drop() or timer timeout |
FIFO waiter queue; timeout via WaitTimeout timer entry |
sleep(d) |
Park | Timer heap → schedule_loop drain |
Inserts Reason::Sleep entry; scheduler unparks on pop |
wait_readable/writable(fd) |
Park | epoll thread → completion queue → scheduler | EPOLLONESHOT; one ADD → one wakeup → one DEL per call |
New Actor From First Resume
Spawning is the trickiest part of the runtime. An actor's first
resume is fundamentally different from subsequent ones because we can't
"call" into a new stack — we have to ret into it.
scheduler::spawn(f) called
Allocates a slot from free list or grows the slots vec. Assigns Pid(index, generation). Creates a Stack (64 KB mmap + guard page).
Initial stack frame written context::init_actor_stack()
Starting from top & ~15 - 8 (aligned), pushes downward: the trampoline function pointer as the ret address, then 6 zero words for the callee-saved registers. The resulting rsp is stored as actor.sp. No actual function call has happened yet.
high addr ← top
top-8: &trampoline ← will be popped by 'ret'
top-16: 0 ← rbx
top-24: 0 ← rbp
top-32: 0 ← r12
top-40: 0 ← r13
top-48: 0 ← r14
top-56: 0 ← r15 ← initial rsp stored here
Closure stored separately
The closure Box<dyn FnOnce() + Send> goes into SharedState::pending_closures keyed by PID — not
on the actor's stack. This is because we can't pass it via a register
during first resume. The PID is pushed to the run queue; slot state is Runnable.
Scheduler picks up the PID, prepares first resume
Before calling switch_to_actor(), the scheduler pops the closure from pending_closures and writes it to the CURRENT_ACTOR_BOX thread-local. Then sets ACTOR_SP, sets CURRENT_PID, arms the timeslice, enables preemption.
First context switch lands in trampoline()
switch_to_actor() saves the scheduler's GPRs, loads actor.sp as the new rsp, pops the 6 zero words (restoring the "saved" registers to zero), then rets — which pops the trampoline address from the stack and jumps to it. We're now executing on the actor's stack.
trampoline() reads the closure and runs it
Takes the closure from CURRENT_ACTOR_BOX thread-local (consuming it — subsequent resumes skip this). Calls it inside panic::catch_unwind(AssertUnwindSafe(f)). The actor's code runs normally from here. Any yield (channel, mutex, preemption) calls switch_to_scheduler(); the scheduler saves actor state, processes intent, loops.
Actor returns → trampoline handles completion
If catch_unwind returns Ok(()), outcome is Exit. If it returns Err(payload), outcome is Panic(payload). Either way, outcome is written to LAST_OUTCOME thread-local, ACTOR_DONE is set to true, then switch_to_scheduler() is called for the last time. Scheduler sees is_actor_done() == true, calls finalize_actor(): delivers Signal to supervisor, unparks joiners, reclaims slot.
Allocator-Driven Timeslicing
How it works
The PreemptingAllocator is installed as the process's #[global_allocator]. Its alloc(), alloc_zeroed(), and realloc() all call maybe_preempt() before delegating to the system allocator.
maybe_preempt() decrements a thread-local counter. Every 128 allocations (default), it reads RDTSC. If rdtsc() - timeslice_start > 300_000 cycles (~100µs at 3 GHz) and PREEMPTION_ENABLED == true, it calls switch_to_scheduler().
The check!() macro calls the same maybe_preempt() function — for tight loops that make no allocations.
Invariant: preemption must be off when holding smarm locks
If preemption fired while the scheduler held SharedState, the context switch would try to re-acquire the same mutex → deadlock. smarm prevents this with:
PREEMPTION_ENABLED = falsein the scheduler loop before/afterswitch_to_actor()with_shared()saves and disables preemption while the mutex is heldNoPreemptRAII guard used in channel/mutex slow pathstrace::record()also disables preemption (it can allocate)
Known gap: tight no-alloc loops are invisible without explicit check!() calls. This is documented and by design — such loops are uncommon in message-passing workloads.
// preempt.rs — simplified
pub fn maybe_preempt() {
ALLOC_COUNT.with(|c| {
let n = c.get();
if n == 0 {
c.set(ACTIVE_ALLOC_INTERVAL.with(|i| i.get())); // reset counter
if PREEMPTION_ENABLED.with(|e| e.get()) {
let elapsed = rdtsc() - TIMESLICE_START.with(|s| s.get());
if elapsed > ACTIVE_TIMESLICE_CYCLES.with(|i| i.get()) {
unsafe { switch_to_scheduler() }; // YieldIntent::Yield
}
}
} else {
c.set(n - 1);
}
});
}
Two Background Threads, One Wake Pipe
epoll_ctl ADD/DEL is called by the scheduler thread directly on the epollfd — this is legal per the epoll_ctl(2) man page even while the epoll thread is inside epoll_wait. Avoids needing a second command channel.
Things That Would Bite You
Between registering as a channel's parked_receiver and calling park_current(), a sender could call unpark(). At that moment the actor is still Runnable, so unpark() sets pending_unpark = true instead of re-queuing. The scheduler checks this flag after the Park yield and re-queues immediately rather than parking. This flag also protects epoll and mutex paths.
std::thread::sleep inside actorBlocks the entire OS scheduler thread, starving every actor assigned to that thread. There's no detection. Use smarm::sleep(d) instead.
SharedStateThe with_shared() helper disables preemption while the mutex is held. But any code path that allocates inside with_shared and then tries to acquire SharedState again will deadlock. All internal smarm code is carefully structured to avoid this.
All N scheduler threads contend on a single Mutex<SharedState>.
This is the primary scalability ceiling — visible in the benchmark
suite as "tokio-favored" scenarios. Identified, documented, deferred.
The fix would be per-thread deques with work stealing.
When a mutex lock is granted before its timeout, the timer entry stays in the heap. It fires eventually, the callback sees "actor is no longer waiting" and no-ops. Cost is ~32 bytes and a few cycles per stale entry. Bounded by one entry per parked actor.
If an actor dies while waiting on an fd, the epoll registration
is leaked. EPOLLONESHOT bounds damage to one stale wakeup, which the
scheduler drops when it can't find the PID in waiters. Noted in io.rs as a known gap for a future pass.
This is intentional and correct. XMM0–15 are caller-saved in SysV AMD64 ABI. Every yield passes through a Rust call site, so the compiler has already spilled live XMM values to the actor's stack before we get to the naked asm. They're restored when the actor resumes because they're on its own stack.
panic = unwind is requiredThe trampoline uses catch_unwind to intercept actor panics before they reach the naked assembly shim. If a user sets panic = abort,
panics kill the process instead of being caught — the supervision tree
collapses to process death. This is documented and the profile is set in
Cargo.toml.